Acknowledgement

Grammarly, an AI-powered writing assistant tool, enhanced the clarity, coherence, and grammatical accuracy of initial notes and the final draft. Its assistance in identifying errors and suggesting style and vocabulary improvements significantly contributed to the quality of this report.

Disclaimer

This project is presented in a single R notebook for demonstration purposes. In a professional setting, a modular structure is recommended to follow best practices and improve maintainability. Data preprocessing, feature engineering, and model training would typically be separated into dedicated scripts or notebooks, with clear management of training, validation, and test datasets to enhance reproducibility and reduce overfitting.

Introduction

The “17K Mobile Strategy Games” dataset from Kaggle provides comprehensive information about mobile strategy games, including attributes such as average user ratings, price, age rating, app size, release dates, supported languages, and genres. This dataset is an excellent foundation for exploring patterns and relationships that drive user engagement and satisfaction in mobile games. Its diverse range of features allows for an in-depth analysis of the factors influencing user satisfaction.

The initial hypothesis assumes that attributes such as higher number of user ratings, positive sentiment in app descriptions, popular genres, app sizes, and support for multiple languages are associated with higher satisfaction levels, as reflected in Average User Rating. Furthermore, due to the complex relationships between predictors and ratings, user satisfaction is likely influenced by a combination of factors, with the variables interacting in subtle and intricate ways. A secondary hypothesis suggests that apps with high user engagement, as indicated by a larger User Rating Count, tend to have more stable and moderate ratings. However, apps with extreme average ratings (either very high or very low) are likely to exhibit fewer reviews, leading to less reliable averages influenced by limited feedback.

XGBoost was chosen as the modelling algorithm due to its exceptional performance in predictive modelling tasks and its dominance in structured data challenges, particularly in Kaggle competitions. Known for its efficiency and ability to handle complex interactions, XGBoost is well-suited for this analysis, offering a robust platform for both numeric prediction and feature importance analysis, outperforming other methods like deep neural networks on structured data (Lantz, 2023). By leveraging this technique, the report aims to uncover actionable insights and build a reliable model to understand and predict user satisfaction in mobile games.

Data Processing

Data Overview and Preprocessing

The original dataset consists of the following features:

  1. URL: The URL to the app’s page in the App Store
  2. ID: The app’s assigned ID
  3. Name: The game’s name
  4. Subtitle: The secondary text under the name
  5. Icon URL: The URL to the app’s icon
  6. Average User Rating: Rounded to nearest .5, requires at least 5 ratings
  7. User Rating Count: : Number of ratings internationally, null means it is below 5
  8. Price: Price in USD
  9. In-app Purchases: Prices of available in-app purchases
  10. Description: The app’s description
  11. Developer: The app developer
  12. Age Rating: Either 4+, 9+, 12+ or 17+
  13. Languages: ISO2A language codes
  14. Size: Size of the app in bytes
  15. Primary Genre: The main genre
  16. Genres: Genres of the app
  17. Original Release Date: When the app was released
  18. Current Version Release Date: When the app was last updated

The first step in the analysis involves identifying anomalies and preparing the data to ensure its quality and suitability for modelling.

# Load the original dataset
original_data <- read_csv("appstore_games.csv")

# Summary of detected anomalies in the dataset
anomalies <- xray::anomalies(original_data)

knitr::kable(
  anomalies$variables, 
  caption = "Summary of Detected Anomalies in the Original Dataset")
Table 1: Summary of Detected Anomalies in the Original Dataset
Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct type anomalous_percent
Price 17007 24 0.14% 14212 83.57% 0 - 0 - 25 Numeric 83.71%
Subtitle 17007 11746 69.07% 0 - 0 - 0 - 5011 Character 69.07%
Average User Rating 17007 9446 55.54% 0 - 0 - 0 - 10 Numeric 55.54%
User Rating Count 17007 9446 55.54% 0 - 0 - 0 - 1793 Numeric 55.54%
In-app Purchases 17007 9324 54.82% 14 0.08% 0 - 0 - 3804 Character 54.91%
Languages 17007 60 0.35% 0 - 0 - 0 - 991 Character 0.35%
Size 17007 1 0.01% 0 - 0 - 0 - 15795 Numeric 0.01%
Name 17007 0 - 1 0.01% 0 - 0 - 16847 Character 0.01%
Age Rating 17007 0 - 0 - 0 - 0 - 4 Character -
Primary Genre 17007 0 - 0 - 0 - 0 - 21 Character -
Genres 17007 0 - 0 - 0 - 0 - 1004 Character -
Current Version Release Date 17007 0 - 0 - 0 - 0 - 2512 Character -
Original Release Date 17007 0 - 0 - 0 - 0 - 3084 Character -
Developer 17007 0 - 0 - 0 - 0 - 8693 Character -
Description 17007 0 - 0 - 0 - 0 - 16473 Character -
URL 17007 0 - 0 - 0 - 0 - 16847 Character -
ID 17007 0 - 0 - 0 - 0 - 16847 Numeric -
Icon URL 17007 0 - 0 - 0 - 0 - 16847 Character -

Anomalies, such as duplicated observations and missing values, should be addressed to maintain the integrity and reliability of the dataset. Observations with missing values for Average User Rating, the target variable in this analysis, will be removed, as null values represent incomplete information that cannot meaningfully contribute to the machine learning process. Similarly, features like URL and ID, which are unique identifiers with no predictive value, will be excluded to reduce noise and prevent unnecessary model complexity.

# Create a clean dataset from the original data
games_data <- original_data |> 
    distinct() |>                            # Remove duplicate rows based on all columns
    filter(!is.na(`Average User Rating`)) |> # Remove rows with null values in the 'Average User   Rating' column
    select(-URL, -ID, -`Icon URL`)           # Drop unnecessary columns: 'URL', 'ID', and 'Icon URL'

Dataset Splitting

To avoid data leakage and ensure a robust evaluation, the dataset is to be split into training (70%), validation (15%), and test (15%) subsets using stratified sampling based on the target variable. Stratification preserves the Average User Rating distribution across subsets, ensuring representativeness. The training set serves for data transformations, preprocessing, and exploratory data analysis (EDA), while the validation set will guide hyperparameter tuning. The test set will be held out for final evaluation, ensuring an unbiased assessment of the model on unseen data.

# Load the rsample library for data splitting
library(rsample)

# Set seed for reproducibility
set.seed(123)

# 1. Split the data into training, validation, and test sets

# Initial split: 70% training, 30% for validation + test, stratified by the target variable
split <- initial_split(games_data, prop = 0.7, strata = `Average User Rating`)

# Extract the training dataset
games_train <- training(split)

# Split the remaining 30% into validation and test sets (50/50), stratified by the target variable
valid_test_split <- initial_split(testing(split), prop = 0.5, strata = `Average User Rating`)

# Extract validation and test datasets
games_validation <- training(valid_test_split)
games_test <- testing(valid_test_split)

# 2. Separate features and labels for each dataset

# Training set: features and labels
train_features_unprocessed <- games_train |> select(-`Average User Rating`)
train_labels <- games_train$`Average User Rating`

# Validation set: features and labels
validation_features_unprocessed <- games_validation |> select(-`Average User Rating`)
validation_labels <- games_validation$`Average User Rating`

# Test set: features and labels
test_features_unprocessed <- games_test |> select(-`Average User Rating`)
test_labels <- games_test$`Average User Rating`

Although the mean target variable is consistent across subsets, slight variations in standard deviation reflect differences in the distribution of outliers.

# Calculate summary statistics for the datasets
summary_table <- tibble(
  Dataset = c("Training", "Validation", "Test"),
  Rows = c(nrow(games_train), nrow(games_validation), nrow(games_test)), 
  Columns = c(ncol(games_train), ncol(games_validation), ncol(games_test)), 
  `Target Mean` = c(
    mean(games_train$`Average User Rating`), 
    mean(games_validation$`Average User Rating`), 
    mean(games_test$`Average User Rating`)
  ),
  `Target StdDev` = c(                                     
    sd(games_train$`Average User Rating`), 
    sd(games_validation$`Average User Rating`), 
    sd(games_test$`Average User Rating`)
  )
)

# Print the summary table with a caption
knitr::kable(
  summary_table, 
  caption = "Summary of Training, Validation, and Test Sets") |> 
  kable_styling(
    position = "center",
    full_width = FALSE)
Table 2: Table 3: Summary of Training, Validation, and Test Sets
Dataset Rows Columns Target Mean Target StdDev
Training 5240 15 4.062977 0.7491444
Validation 1124 15 4.058274 0.7495520
Test 1124 15 4.061833 0.7584212

The splits appear suitable for reliable training and evaluation in Figure 1.

# Combine data for ggplot
plot_split_data <- data.frame(
    value = c(
        games_train$`Average User Rating`, 
        games_validation$`Average User Rating`, 
        games_test$`Average User Rating`
        ), # Combine target variable values from all splits
  split = factor(rep(c("Training Set", "Validation Set", "Test Set"))) 
) 

# Create an interactive density plot using ggplot2 and ggplotly
ggplotly(
  ggplot(plot_split_data, aes(x = value, color = split, fill = split)) + 
    geom_density(alpha = 0.5) +                               # Density plot with transparency
    labs(
      title = "Distributions Across Splits",                  
      x = "Average User Rating",                              
      y = "Density"                                           
    ) +
    theme_tufte(base_family = "Aeonik") +                 
    theme(
      plot.title = element_text(hjust = 0.5),                 
      legend.position = "top"                                 
    ) +
    scale_fill_brewer(palette = "Blues", direction = -1) +    # Custom fill colours from Brewer palette
    scale_color_manual(values = c("darkblue", "blue", "royalblue"))) |> 
  layout(
    annotations = list(
      text = "Hover over the plot for details",               # Add a subtitle
      x = 0.5,                                                # Center the subtitle
      y = 1.05,                                               # Position above the plot
      xref = "paper",
      yref = "paper",
      showarrow = FALSE,                                      # Disable the arrow for annotation
      font = list(size = 12, family = "Aeonik")            # Set font style for the annotation
    )
  )

Figure 1: Distributions of the training, validation, and test sets.

Feature Engineering and Transformations

Feature engineering is a critical factor in machine learning models. As emphasised by Kaggle champion Xavier Conort, while algorithms are often standard, the key to success lies in the thoughtful creation and selection of features. By carefully designing features, model accuracy and robustness can be significantly improved, making feature engineering a crucial differentiator between high-performing and average models (Conort, 2013).

Feature engineering was performed to convert non-numeric attributes into numerical values, as required by XGBoost. This involved applying transformations using recipe package such as:

  1. Convert Size to megabytes.
  2. Transform Age Rating to an integer by removing “+” symbols.
  3. Replace Languages with the number of supported languages, assigning a minimum value of 1 for missing entries.
  4. Encode Subtitle as 1 for presence, 0 otherwise.
  5. Calculate market presence as the difference between Current Version Release Date and Original Release Date.
  6. Convert Original Release Date and Current Version Release Date to years.
  7. Calculate a sentiment analysis score for the Description.
  8. Create a binary attribute for apps requiring payment to download (1 if Price > 0, otherwise 0).
  9. Convert In-app Purchase into a binary attribute (1 if > 0, otherwise 0), handling missing values.
  10. Create a binary feature for apps requiring payment (1 if Price or In-app Purchase > 0, otherwise 0).
  11. Add an attribute indicating awards/nominations based on the Description.
  12. One-hot encode Genres, skipping duplication with Primary Genre.
  13. Remove non-numeric variables.
# Load necessary libraries
library(recipes)        # For creating and managing data preprocessing workflows
library(textrecipes)    # For text-specific preprocessing steps
library(stringr)        # For string manipulation
library(lubridate)      # For working with date and time data
library(syuzhet)        # For sentiment analysis

# Create a recipe
games_recipe <- recipe(~ ., data = train_features_unprocessed) |> 
  
  # Step 1: Convert 'Size' to megabytes
  step_mutate(Size = Size / 1024 / 1024) |> 
  
  # Step 2: Remove "+" from Age Rating and transform it to an integer
  step_mutate(`Age Rating` = as.integer(str_remove(`Age Rating`, "\\+"))) |> 
  
  # Step 3: Transform Languages to the number of supported languages (number of commas + 1, with a minimum of 1)
  step_mutate(Languages = as.integer(ifelse(is.na(Languages), 1, str_count(Languages, ",") + 1))) |> 
  
  # Step 4: Convert Subtitle into a binary numeric value (1 for presence, 0 for absence)
  step_mutate(Subtitle = as.integer(ifelse(is.na(Subtitle), 0, 1))) |> 
  
  # Step 5: Calculate market presence (difference between Current and Original Release Dates in years)
  step_mutate(
    `Original Release Date` = dmy(`Original Release Date`),
    `Current Version Release Date` = dmy(`Current Version Release Date`),
    `Market Presence` = round(as.numeric(`Current Version Release Date` - `Original Release Date`) / 365.25, 1)) |>  # Divide by 365.25 to account for leap years
  
  # Step 6: Convert Original and Current Release Dates to years as integers
  step_mutate(
    `Original Release Date` = as.integer(year(`Original Release Date`)),
    `Current Version Release Date` = as.integer(year(`Current Version Release Date`))) |> 
  
  # Step 7: Calculate sentiment analysis score for the Description column
  step_mutate(`Description Sentiment` = get_sentiment(`Description`, method = "syuzhet")) |> 
  
  # Step 8: Create a binary attribute for Pay to Download (1 if Price > 0, 0 otherwise)
  step_mutate(`Pay to Download` = as.integer(ifelse(Price > 0, 1, 0))) |> 
  
  # Step 9: Handle missing In-app Purchases values (set missing to 0) and create a binary attribute
  step_mutate(`In-app Purchases` = ifelse(is.na(`In-app Purchases`), "0", `In-app Purchases`)) |> 
  step_mutate(`In-app Purchases` = as.integer(ifelse(`In-app Purchases` == "0", 0, 1))) |> 
  
  # Step 10: Create a binary attribute for paid apps (1 if Price > 0 or In-app Purchases > 0, 0 otherwise)
  step_mutate(`Paid App` = as.integer(ifelse(Price > 0 | `In-app Purchases` == 1, 1, 0))) |> 
  
  # Step 11: Create a binary attribute indicating if the app has awards/nominations in its Description
  step_mutate(`Award or Nomination` = as.integer(ifelse(
    str_detect(`Description`, regex("award|nominat|prize|recognition", ignore_case = TRUE)),
    1, 0))) |>
  
  # Step 12: Handle Genres column by creating binary columns for each genre
  step_mutate(Genres = gsub("(\\w+)\\s(\\w+)", "\\1_\\2", Genres)) |>   # Replace spaces in multi-word genres with underscores
  step_tokenize(Genres) |>         # Tokenise the Genres column
  step_tf(Genres) |>               # Create binary columns for each genre
  step_rename_at(all_predictors(), fn = ~ gsub("tf_", "", ., fixed = TRUE)) |>  # Remove "tf_" prefix from column names
  step_rename_at(all_predictors(), fn = ~ gsub("_", " ", ., fixed = TRUE)) |>  # Replace underscores back with spaces
  
  # Step 13: Select only numeric variables (remove non-numeric fields)
  step_select(where(is.numeric))

# Prepare the recipe
games_recipe <- prep(games_recipe, training = train_features_unprocessed, retain = TRUE)

# Apply the transformations to the training dataset
train_features <- bake(games_recipe, new_data = train_features_unprocessed)

# Combine preprocessed features with labels for Exploratory Data Analysis (EDA)
train_set <- bind_cols(train_features, `Average User Rating` = train_labels)

# Apply the transformations to the validation dataset
validation_features <- bake(games_recipe, new_data = validation_features_unprocessed)

# Apply the transformations to the test dataset
test_features <- bake(games_recipe, new_data = test_features_unprocessed)

These transformations expanded the training dataset from 15 to 60 variables, capturing a more comprehensive data representation.

tibble [5,240 × 60] (S3: tbl_df/tbl/data.frame)
 $ Subtitle                    : int [1:5240] 0 0 0 1 0 0 0 0 0 0 ...
 $ User Rating Count           : num [1:5240] 284 8376 28 47 35 ...
 $ Price                       : num [1:5240] 1.99 0 2.99 0 0 0 0 0 0 0.99 ...
 $ In-app Purchases            : int [1:5240] 0 0 0 1 0 0 0 0 0 0 ...
 $ Age Rating                  : int [1:5240] 4 4 4 4 4 4 4 4 4 4 ...
 $ Languages                   : int [1:5240] 1 1 15 1 1 1 1 1 1 1 ...
 $ Size                        : num [1:5240] 11.758 0.644 33.082 46.418 6.035 ...
 $ Original Release Date       : int [1:5240] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ Current Version Release Date: int [1:5240] 2018 2017 2018 2019 2013 2017 2008 2008 2009 2019 ...
 $ Market Presence             : num [1:5240] 9.8 9.2 10 10.7 5.3 9.3 0 0.3 0.9 10.7 ...
 $ Description Sentiment       : num [1:5240] 8.4 3.8 6.95 12.85 1.8 ...
 $ Pay to Download             : int [1:5240] 1 0 1 0 0 0 0 0 0 1 ...
 $ Paid App                    : int [1:5240] 1 0 1 1 0 0 0 0 0 1 ...
 $ Award or Nomination         : int [1:5240] 0 0 1 0 0 0 0 0 0 1 ...
 $ Genres action               : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres adventure            : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres board                : int [1:5240] 1 1 1 0 0 1 0 0 0 0 ...
 $ Genres books                : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres business             : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres card                 : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres casino               : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres casual               : int [1:5240] 0 0 0 0 0 0 1 0 0 0 ...
 $ Genres drink                : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres education            : int [1:5240] 0 0 1 0 0 0 0 0 0 0 ...
 $ Genres emoji                : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres entertainment        : int [1:5240] 0 0 0 1 1 1 0 0 1 0 ...
 $ Genres expressions          : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres family               : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres finance              : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres fitness              : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres food                 : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres games                : int [1:5240] 1 1 1 1 1 1 1 1 1 1 ...
 $ Genres gaming               : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres health               : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres kids                 : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres lifestyle            : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres magazines            : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres medical              : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres music                : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres navigation           : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres news                 : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres newspapers           : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres photo                : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres productivity         : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres puzzle               : int [1:5240] 0 0 0 1 1 0 0 1 0 1 ...
 $ Genres racing               : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres reference            : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres role playing         : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres shopping             : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres simulation           : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres social networking    : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres sports               : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres stickers             : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres strategy             : int [1:5240] 1 1 1 1 1 1 1 1 1 1 ...
 $ Genres travel               : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres trivia               : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres utilities            : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres video                : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Genres word                 : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
 $ Average User Rating         : num [1:5240] 3.5 3 3.5 3 2.5 2.5 2.5 3.5 3 3.5 ...

The preprocessing and transformations were first applied to the training data and subsequently mirrored on the validation and test sets to maintain consistency and avoid data leakage. By carefully designing features and applying rigorous preprocessing, the dataset was effectively prepared for modelling with XGBoost, providing a solid foundation for exploring patterns and making accurate predictions.

Exploratory Analysis

We can now delve into the factors influencing Average User Rating, to provide a comprehensive overview of patterns, relationships, and trends within the data. By examining key variables and their interactions, this analysis seeks to uncover meaningful insights into user satisfaction, laying the groundwork for robust predictive modelling.

While numerous variables could potentially explain the variation in Average User Rating, this analysis is focused on the main predictors. This decision stems from the inherent complexity of relationships between variables, making it challenging to identify all significant contributors manually. Advanced machine learning models like XGBoost are particularly valuable in this context, as they are designed to automatically detect and prioritise the most important variables, even when the relationships are non-linear or subtle.

By leveraging XGBoost, the analysis ensures that important factors are not overlooked due to the limitations of traditional methods or subjective bias. This allows us to streamline the analysis by concentrating on the most influential variables while the algorithm captures complex interactions and dependencies. This methodology enhances the insights’ accuracy and makes the analysis more focused and actionable.

Distribution of Average User Ratings

Understanding the distribution of the target variable (Average User Rating) is crucial in machine learning, as it directly influences the performance and reliability of predictive models. Figure 2 illustrates this distribution, segmented by whether the app is free or paid.

# Create the bar plot
ggplot(train_set, aes(x = `Average User Rating`, fill = factor(`Paid App`))) +
  scale_x_continuous(breaks = seq(0, 5, by = 0.5)) + 
  geom_bar(bins = 30, alpha = 0.8, position = "stack", color = "black", width = 0.3) +
  labs(
    title = "Distribution of Average User Rating",
    x = "Average User Rating",
    y = "Count",
    fill = "Free App?"
  ) +
  scale_fill_manual(values = c("lightblue", "skyblue"), labels = c("Yes", "No")) +
  theme_tufte(base_family = "Aeonik") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
    )
Distribution of the target variable (`Average User Rating`).

Figure 2: Distribution of the target variable (Average User Rating).

The chart reveals a left-skewed distribution, with a pronounced peak in the 4.0 to 4.5 range and relatively few extremely low ratings. Paid apps are prevalent across all rating ranges, representing most apps with both high and low ratings. In contrast, though fewer, free apps exhibit a more even distribution across rating categories and appear to have slightly less variability in average ratings compared to paid apps (Figure 3).

# Load the ggridges package for creating ridgeline plots
library(ggridges)

# Create the ridgeline plot to visualise the distribution of 'Average User Rating'
ggplot(train_set, aes(x = `Average User Rating`, y = factor(`Paid App`, labels = c("Free App", "Paid App")), fill = factor(`Paid App`))) +
  geom_density_ridges(alpha = 0.8, scale = 1) +
  scale_x_continuous(breaks = seq(0, 5, by = 0.5)) + 
  scale_fill_manual(values = c("lightblue", "skyblue")) +
  labs(
    title = "Distribution of Average User Rating",
    x = "Average User Rating",
    y = "",
    fill = "Free App?"
  ) +
  theme_tufte(base_family = "Aeonik") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "none"
    )
The ridgeline plot of the `Average User Rating` distribution split by the app type.

Figure 3: The ridgeline plot of the Average User Rating distribution split by the app type.

This concentration of ratings in the higher range has significant implications for machine learning modelling. The skewed distribution introduces class imbalance, with most ratings clustered within a narrow range (e.g., 4.0–4.5), potentially making it difficult for models to accurately predict lower or extreme ratings, which are underrepresented in the dataset. Furthermore, the dominance of paid apps suggests that pricing may interact with other features, adding complexity to determining variable importance in the model.

Relationships Between Predictors and Average User Ratings

Figure 4 provides a comprehensive overview of the relationships between several potentially influential predictors — User Rating Count, Languages, Size, Age Rating, and Original Release Date — and the Average User Rating.

library(GGally)

# Select top numeric predictors and target variable
ggpairs_data <- train_set |>
  select(`Average User Rating`, `User Rating Count`, `Languages`, `Size`, `Age Rating`, `Original Release Date`)

# Pair plot
ggpairs(
  ggpairs_data,
  aes(fill = `Average User Rating`),
  upper = list(continuous = wrap("cor", size = 2.5)),
  lower = list(continuous = wrap("points", alpha = 0.6))
) +
  theme_tufte(base_family = "Aeonik") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
    ) +
  labs(title = "Key Predictors and Average User Rating")
The scatterplot matrix of several potential predictors.

Figure 4: The scatterplot matrix of several potential predictors.

The analysis reveals that the number of supported Languages has a weak but statistically significant positive correlation with Average User Rating. Similarly, the Original Release Date shows a moderate positive correlation, indicating that newer apps receive higher ratings, possibly reflecting alignment with current user preferences or trends. Additionally, a weak to moderate positive correlation is observed between Age Rating and Average User Rating, suggesting that as Age Rating increases (indicating content suitable for older audiences), Average User Rating tends to rise slightly. While these relationships are statistically significant, their correlation values indicate that the associations are not particularly strong, suggesting the multifactorial nature of user ratings, with no single variable strongly driving the outcome. This highlights the need for a comprehensive modelling approach capable of capturing the complex interactions between multiple predictors, as these interactions likely influence user ratings in nuanced and non-linear ways.

Sentiment and Satisfaction: App Descriptions and User Ratings

Figure 5 visualises the relationship between Description Sentiment and Average User Rating, incorporating additional dimensions such as User Rating Count (indicated by bubble size) and Age Rating (represented by colour intensity). This chart provides insights into how the tone of app descriptions correlates with user satisfaction and other factors.

# Bubble plot of Description Sentiment vs. Average User Rating with bubble size as User Rating Count
ggplotly(ggplot(train_set, aes(x = `Description Sentiment`, y = `Average User Rating`, size = `User Rating Count`, color = `Age Rating`)) +
  geom_point(alpha = 0.6) +
  scale_color_gradient(low = "lightblue", high = "darkblue", name = "Age Rating") +
  scale_size_continuous(range = c(1, 10), name = "User Rating Count", labels = comma) +  # Format legend numbers
  scale_y_continuous(breaks = seq(0, 5, 0.5)) +  
  labs(title = "Description Sentiment vs. Average User Rating",
       x = "Description Sentiment",
       y = "Average User Rating") +
  theme_tufte(base_family = "Aeonik") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
    )) |> 
  layout(
  annotations = list(
    text = "Hover over the plot for details",
    x = 0.5,                                       # Center the subtitle
    y = 1.05,                                      # Position above the chart
    xref = "paper",
    yref = "paper",
    showarrow = FALSE,
    font = list(size = 12, family = "Aeonik")
  )
)

Figure 5: The relationship between Description Sentiment and Average User Rating. User Rating Count is indicated by bubble size and Age Rating is represented by colour intensity.

The data reveals a weak association between Description Sentiment and Average User Rating, with apps scoring neutral to slightly positive sentiments (sentiment scores around 0 to 10) dominating the rating range. Extreme sentiment scores, whether highly positive or negative, are less common and do not exhibit clear trends of higher or lower ratings. This suggests that while description sentiment may influence user perceptions to some extent, it is not a strong standalone predictor of user ratings.

Bubble sizes indicate substantial variation in user engagement across apps. Larger bubbles, representing apps with higher User Rating Counts, are concentrated around Average User Ratings of 4.0 to 4.5, indicating that apps with higher ratings generally attract more feedback. Notably, there is an obvious outlier in terms of User Rating Count, concentrated around an Average User Rating of 4.5. Conversely, smaller bubbles, corresponding to apps with fewer ratings, show greater variability in average ratings, which may result from limited feedback and higher volatility.

The colour intensity, representing Age Rating, does not exhibit a strong relationship with either Description Sentiment or Average User Rating. Apps across all age groups are spread throughout the sentiment spectrum, indicating that content appropriateness (as measured by age rating) does not significantly correlate with sentiment or satisfaction.

The plot highlights that while Description Sentiment plays a minor role, user ratings are influenced by a combination of factors. Apps with neutral to moderately positive sentiment and high user engagement tend to achieve higher ratings, but sentiment alone does not provide a strong predictive signal for user satisfaction. This underscores the multifactorial nature of user ratings, requiring a broader consideration of variables to fully understand the drivers of user satisfaction.

Exploring User Preferences: Average Ratings and Engagement by Genre

Figure 6 illustrates the relationship between Average User Rating and the average User Rating Count across various genres, providing a clear visualisation of top-performing categories in terms of both ratings and user engagement. This analysis offers valuable insights into the genres that resonate most with users, serving as a foundation for selecting relevant genres in predictive models.

library(reshape2)

# Create a data subset with avg. user rating per genre
genre_columns <- grep("^Genres", colnames(train_set), value = TRUE)
heatmap_data <- train_set |>
  select(all_of(genre_columns), `Average User Rating`, `User Rating Count`) |>
  pivot_longer(cols = all_of(genre_columns), names_to = "Genre", values_to = "Genre_Present") |>
  filter(Genre_Present == 1) |>    # Filter only those rows where the app is present in the genre
  mutate(Genre = sub("Genres ", "", Genre)) |>
  group_by(Genre) |>
  summarize(
    `Average User Rating` = mean(`Average User Rating`, na.rm = TRUE),
    `Average User Rating Count` = mean(`User Rating Count`, na.rm = TRUE)
  ) |>
  ungroup()

# Plot a radar chart
ggplot(heatmap_data, aes(x = reorder(Genre, -`Average User Rating`), y = `Average User Rating`, fill = `Average User Rating Count`)) +
  geom_col() +
  coord_polar(theta = "x") +
  geom_text(aes(label = round(`Average User Rating`, 1)), 
            position = position_stack(vjust = 0.95), 
            size = 2, 
            color = "white") +                              # Add labels for Average Rating
  labs(
    title = "Average User Rating by Genre",
    x = NULL,
    y = NULL,
    fill = "Average User Rating Count"                      # Legend title
  ) +
  scale_fill_gradient(
    low = "lightblue", high = "darkblue",
    labels = scales::comma                                   # Format numbers with commas
  ) +
 theme_minimal(base_family = "Aeonik") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 12),
    legend.position = "right",                
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 6),
    axis.text.x = element_text(angle = 45, hjust = 2, size = 6),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.major.y = element_blank()
  )
The relationship between `Average User Rating` and the average `User Rating Count` across genres.

Figure 6: The relationship between Average User Rating and the average User Rating Count across genres.

High-performing genres, such as Newspapers and Magazines, achieve perfect average ratings (5.0) but have significantly fewer user ratings, indicating that limited feedback may inflate these averages. Conversely, genres like Kids and Reference exhibit very low average ratings and small numbers of user reviews, reflecting less user engagement. In contrast, mid-range genres like Finance and Action display broader user engagement, as indicated by higher rating counts, and maintain more stable average ratings around 4.1.

This pattern highlights an important trend: genres with extreme ratings (either very high or very low) are often associated with fewer reviews, leading to less stable averages influenced by initial or limited feedback. In contrast, genres with higher engagement tend to have more stable, moderate ratings, representing a wider range of user experiences.

These findings underscore the need to consider both Average User Rating and User Rating Count in predictive analysis and strategic decisions. Combining these metrics provides a more comprehensive understanding of user preferences and satisfaction, enabling more accurate and actionable insights.

Predictive Modelling

XGBoost is an excellent choice for predicting Average User Rating because it handles structured data with exceptional efficiency and accuracy. Its gradient boosting framework is particularly effective at capturing complex, non-linear relationships between features and the target variable. Additionally, XGBoost incorporates advanced regularisation techniques to mitigate overfitting, making it highly robust in handling data with variability and noise — common challenges in predicting user satisfaction metrics.

The algorithm’s flexibility in hyperparameter tuning allows for precise optimisation to achieve high performance, further solidifying its reputation as a leading solution for structured data problems. These strengths align perfectly with this analysis’s goals, ensuring accuracy and reliability in the predictive modelling.

Moreover, XGBoost is remarkably insensitive to the normalisation technique, with similar accuracy results, regardless of whether normalising data or not (Cabello-Solorzano et al., 2023). As a result, data normalisation or scaling steps were intentionally omitted in this project, streamlining the workflow while maintaining the model’s effectiveness.

Baseline XGBoost Model

We are now prepared to begin building the model. We initialise the xgboost() function with its default parameters. XGBoost requires the data to be converted into a matrix format, as it does not process tibbles or data frames directly.

library(xgboost)

# Convert training, validation, and test sets into a matrix suitable for XGBoost processing
train_dmatrix <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)

validation_dmatrix <- xgb.DMatrix(data = as.matrix(validation_features))

test_dmatrix <- xgb.DMatrix(data = as.matrix(test_features))
library(tictoc)

# Start timer
tic("Training the model")

# Train an XGBoost model using xgboost() function
xgb_model_base <- xgboost(
  data = train_dmatrix,           # Training data in DMatrix format
  nrounds = 200,                  # Number of boosting iterations
  objective = "reg:squarederror", # Objective function for regression tasks
  eval_metric = "rmse",           # Evaluation metric: RMSE to evaluate model performance
  verbose = 0                     # Suppress output during training
)

# Stop timer and display elapsed time
toc()
Training the model: 1.838 sec elapsed
# Display the parameters used in the model
xgb_model_base$params
$objective
[1] "reg:squarederror"

$eval_metric
[1] "rmse"

$validate_parameters
[1] TRUE

Once the basic XGBoost model is trained, we generate a SHAP summary plot to analyse the importance of features and their impact on predictions. Figure 7 provides a detailed view of how each feature influences model predictions. Each point represents an individual prediction, with the x-axis indicating the SHAP value, quantifying a feature’s contribution to the prediction. The colour gradient illustrates feature values, with one end representing higher values and the other lower values. Features are ranked by importance, with the most influential at the top. This visualisation reveals the key drivers of predictions and whether their effects are positive or negative, offering valuable insights into feature interactions and model behaviour.

# Generate SHAP summary plot
xgb.ggplot.shap.summary(
  model = xgb_model_base,
  data = as.matrix(train_features)) +
  ggtitle("SHAP Feature Importance of the XGBoost Model") +
  theme_tufte(base_family = "Aeonik") +
  theme(
    plot.title = element_text(hjust = 0.5)) +
  labs(
    x = "", 
    y = "SHAP value (impact on model output)",
    color = "Feature Value"
  ) 
The SHAP summary plot of the baseline XGBoost model.

Figure 7: The SHAP summary plot of the baseline XGBoost model.

From the SHAP plot, we observe that Size, User Rating Count, and Description Sentiment are the most influential features affecting Average User Rating. Additionally, features such as recent Original Release Dates and Current Version Dates are positively associated with higher target values, whereas high Market Presence negatively impacts the target variable. These interactions provide insights into how specific features shape predictions and help explain the model’s performance. Notably, four of the most influential features were derived through feature engineering, including Description Sentiment, number of supported Languages, Market Presence-related metrics, and simulation genre. This underscores the critical role of feature engineering in enhancing the model’s predictive power.

The evaluation metrics for the basic XGBoost model indicate reasonable performance and highlight areas for improvement.

# Load necessary library for evaluation metrics
library(Metrics) 

# Define the function to calculate evaluation metrics
evaluation_metrics <- function(actual_labels, predicted_labels, train_features) {
  # Residual sum of squares (RSS)
  rss <- sum((actual_labels - predicted_labels)^2)
  
  # Number of observations and predictors
  n <- length(predicted_labels)       
  p <- ncol(train_features) 
  
  # AIC and BIC calculations
  aic <- n * log(rss / n) + 2 * p
  bic <- n * log(rss / n) + log(n) * p
  
  # Calculate metrics
  metrics <- data.frame(
    Metric = c("RMSE", "MAE", "MAPE", "R-squared", "AIC", "BIC"),
    Value = c(
      sqrt(mean((actual_labels - predicted_labels)^2)),                    # RMSE
      mean(abs(actual_labels - predicted_labels)),                         # MAE
      mean(abs((actual_labels - predicted_labels) / actual_labels)) * 100, # MAPE
      cor(actual_labels, predicted_labels)^2,                              # R-squared
      aic,                                                                 # AIC
      bic                                                                  # BIC
    )
  )
  
  # Return formatted table with the calculated metrics
  return(metrics) 
}
# Predicted validation values using the trained model
validation_predictions_base <- predict(xgb_model_base, newdata = validation_dmatrix)

# Calculate evaluation metrics
evaluation_metrics(validation_labels, validation_predictions_base, train_features) |>  
    # Convert to a kable table
    kable(
      col.names = c("Evaluation Metric", "Value"), 
      caption = "Base Model Performance Metrics", 
      digits = 4
    ) |> 
      kable_styling(
      position = "center",
    full_width = FALSE                            
  ) 
Table 4: Table 5: Base Model Performance Metrics
Evaluation Metric Value
RMSE 0.6988
MAE 0.5210
MAPE 15.3379
R-squared 0.1756
AIC -687.8212
BIC -391.3669

Metrics like RMSE and MAE reveal moderate prediction errors, suggesting the model captures overall trends but struggles with precise value predictions. The MAPE of 15.34% indicates that the model’s predictions are, on average, 15% off the actual value. While the negative AIC and BIC values imply a good balance between model fit and complexity, the low R-squared reflects the challenge of modelling non-linear relationships and high data variability. However, it is important to note that in contexts involving complex data, such as clinical research, even R-squared values above 15% are considered meaningful (Gupta, Stead, and Ganti, 2024).

Rounding the target variable in the original dataset to the nearest 0.5 introduces additional challenges in assessing model performance. The model predicts values like 3.16, which reflects its smoothing and estimation process.

# Display the top 5 predictions from the validation set
cat("Top 5 Predictions:" , head(validation_predictions_base))
Top 5 Predictions: 4.611531 3.105643 3.374446 3.160298 3.590192 3.501201

Manipulating the predicted variable by rounding can distort the model’s performance metrics and interpretation, leading to misleading results. Rounding reduces the precision of the predictions, which may hide the true accuracy and nuances captured by the model. Instead, using the raw model outputs provides a clearer picture of its performance, allowing for a better evaluation of how well it captures underlying trends in the data.

Comparing actual and predicted densities in Figure 8, we observe discrepancies, particularly in the first quartile. However, this underperformance in the smaller quartile was anticipated during the exploratory data analysis (EDA), as lower ratings are underrepresented in the dataset. The model’s difficulties in predicting this segment align with expectations based on the observed skewness and data distribution patterns.

# Function to plot density (distribution) for actual vs predicted values
plot_density_distribution <- function(actual_values, predicted_values) {
  
  # Create a data frame with actual and predicted values
  data <- data.frame(
    value = c(actual_values, predicted_values),
    type = rep(c("Actual", "Predicted"), c(length(actual_values), length(predicted_values)))
  )
  
  # Plot density curves for both actual and predicted values
  ggplot(data, aes(x = value, fill = type, color = type)) +
    geom_density(alpha = 0.5, size = 1) +  # Create density curves with transparency
    scale_fill_manual(values = c("Actual" = "skyblue", "Predicted" = "lightblue")) + 
    scale_color_manual(values = c("Actual" = "skyblue", "Predicted" = "lightblue")) + 
    labs(title = "Density Distribution of the Target Variable",
         x = "Average User Rating", y = "Density",
         fill = "", color = "") +  # Axis and legend labels
    theme_tufte(base_family = "Aeonik") +
    theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
    ) +
    scale_fill_brewer(palette = "Blues", direction = 0)
}

# Plot density distribution
plot_density_distribution(validation_labels, validation_predictions_base) + 
    ggtitle ("XGBoost Basic Model: Density Distribution of the Target Variable")
Density distributions of the actual values of the target variable from the validation set and the values predicted by the baseline XGBoost model.

Figure 8: Density distributions of the actual values of the target variable from the validation set and the values predicted by the baseline XGBoost model.

The summary statistics confirm that the predicted values differ from the validation set primarily in the lower quartile, highlighting discrepancies in this segment. Additionally, the maximum predicted value exceeds the observed range, indicating an overestimation in some cases.

# Generate summary statistics for the actual target variable
glance(summary(validation_labels)) |> 
    kable(
      caption = "Summary Statistics of the Actual Target Variable", 
      digits = 2
    ) |> 
      kable_styling(
      position = "center",
    full_width = FALSE                            
  ) 
Table 6: Table 7: Summary Statistics of the Actual Target Variable
minimum q1 median mean q3 maximum
1 3.5 4 4.06 4.5 5
# Generate summary statistics for the predicted target variable
glance(summary(validation_predictions_base)) |> 
    kable(
      caption = "Summary Statistics of the Predicted Target Variable", 
      digits = 2) |> 
      kable_styling(
      position = "center",
      full_width = FALSE) 
Table 6: Table 6: Summary Statistics of the Predicted Target Variable
minimum q1 median mean q3 maximum
1.97 3.81 4.13 4.05 4.38 5.17

The basic XGBoost model demonstrates solid initial performance, with promising evaluation metrics such as RMSE and MAE that align reasonably well with the variability in the data, as indicated by the standard deviation (0.75). However, further refinements, such as advanced hyperparameter tuning, could enhance its accuracy and improve its ability to predict the target variable more effectively.

Tuning with caret

Tuning the basic XGBoost model with the caret package enhances its performance by systematically optimising hyperparameters such as learning rate, tree depth, and subsampling. These adjustments reduce overfitting, improve predictive accuracy, and better capture the underlying data patterns. The caret package employs techniques like grid search and cross-validation to thoroughly explore the hyperparameter space, enabling a more robust and fine-tuned model. This approach balances model complexity and performance, resulting in more reliable and accurate predictions.

Given the computational intensity of training XGBoost models with caret, particularly when exploring a large hyperparameter grid, the range of parameters was deliberately limited to optimise efficiency. By narrowing the hyperparameter search space, it was still possible to identify a set of parameters that improved model performance while maintaining computational feasibility. This strategy allowed for significant enhancements in predictive accuracy without overwhelming system resources.

library(caret)

# Start timer
tic("Training the model")

# Define the hyperparameter grid to search over
xgb_grid <- expand.grid(
  nrounds = 100,
  eta = c(0.03, 0.05, 0.1),      # Learning rate (you can explore lower values to avoid overfitting)
  gamma = c(0, 1),               # Minimum loss reduction for split
  max_depth = c(6, 7),           # Depth of trees (shallower trees often prevent overfitting)
  min_child_weight = c(2, 3),    # Minimum number of data points in a child node (to control splits)
  subsample = c(0.9, 1),         # Fraction of data used for training in each iteration
  colsample_bytree = c(0.9, 1)   # Fraction of features used for training in each iteration
)

# Set up cross-validation control
train_control <- caret::trainControl(
  method = "cv",                     # 10-fold cross-validation
  number = 3,                        # 3 repeats of 10-fold cross-validation
  verboseIter = FALSE,               # Do not display progress
  search = "grid",                   # Grid search method for hyperparameter tuning
  summaryFunction = defaultSummary,  # RMSE will be used by default
  allowParallel = TRUE 
)

# Train the XGBoost model using caret
set.seed(123)  # Set seed for reproducibility
xgb_caret_model <- caret::train(
  x = train_features,      
  y = train_labels,
  method = "xgbTree",              # XGBoost method in caret
  trControl = train_control,       # Cross-validation settings
  tuneGrid = xgb_grid              # Grid of hyperparameters to search
  )

# Stop timer and display elapsed time
toc()
Training the model: 191.085 sec elapsed
# View the best parameters
kable(xgb_caret_model$bestTune, caption = "Best Parameters") 
Table 8: Best Parameters
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
43 100 6 0.05 1 0.9 3 0.9

As visualised in Figure 9, the tuned model’s predictions show a greater concentration of values between the 1st and 3rd quartiles, reflecting an improvement in capturing the target variable’s distribution.

# Predicted validation values using the trained model
validation_predictions_caret <- predict(xgb_caret_model, newdata = validation_features)

# Plot the density distribution for actual vs predicted values
plot_density_distribution(validation_labels, validation_predictions_caret) + 
    ggtitle ("`caret`-Tuned XGBoost Model: Density Distribution of the Target Variable")
Density distributions of the actual values of the target variable from the validation set and the values predicted by the XGBoost model tuned with `caret`.

Figure 9: Density distributions of the actual values of the target variable from the validation set and the values predicted by the XGBoost model tuned with caret.

# Generate summary statistics for the actual target variable
glance(summary(validation_labels)) |> 
    kable(
      caption = "Summary Statistics of the Actual Target Variable", 
      digits = 2) |> 
      kable_styling(
      position = "center",
    full_width = FALSE) 
Table 9: Table 10: Summary Statistics of the Actual Target Variable
minimum q1 median mean q3 maximum
1 3.5 4 4.06 4.5 5
# Generate summary statistics for the predicted target variable
glance(summary(validation_predictions_caret)) |> 
    kable(
      caption = "Summary Statistics of the Predicted Target Variable", 
      digits = 2) |> 
      kable_styling(
      position = "center",
    full_width = FALSE) 
Table 9: Table 9: Summary Statistics of the Predicted Target Variable
minimum q1 median mean q3 maximum
2.74 3.84 4.07 4.03 4.29 4.7

Evaluation metrics demonstrate a noticeable performance boost over the base model, with a more than 30% improvement in BIC and meaningful gains in R-squared. These improvements are particularly significant for complex datasets, where even small increases in model performance can provide valuable insights.

# Calculate evaluation metrics
evaluation_metrics(validation_labels, validation_predictions_caret, train_features) |> 
    # display the summary statistics in a table format
    kable(
      col.names = c("Evaluation Metric", "Value"), 
      caption = "Caret-Tuned Model Performance Metrics", 
      digits = 4) |> 
      kable_styling(
      position = "center",
    full_width = FALSE) 
Table 11: Table 12: Caret-Tuned Model Performance Metrics
Evaluation Metric Value
RMSE 0.6627
MAE 0.4978
MAPE 14.8479
R-squared 0.2195
AIC -806.7621
BIC -510.3078

While tuning with caret significantly enhanced model accuracy and robustness, it remains computationally intensive, particularly with larger hyperparameter grids. This highlights the importance of balancing hyperparameter optimisation with computational efficiency to achieve reliable results without exceeding system capabilities.

Final Model Development with xgb.train()

The xgb.train() function from the xgboost package offers greater efficiency and flexibility than the xgboost(), which was used for training the basic model. It supports advanced features such as early stopping, fine-grained control over training parameters, and enhanced handling of evaluation metrics. These capabilities make xgb.train() a more tailored and faster solution for gradient boosting tasks, especially when compared to the computationally intensive tuning process with the caret package.

This model’s parameters were set based on the best-performing hyperparameters identified during tuning with caret. This approach leverages the exhaustive grid search and cross-validation performed by caret to establish a strong baseline for further refinement using xgb.train(). By adopting these pre-optimised parameters, the xgb.train() model benefits from the insights gained during the more intensive caret tuning process while achieving greater computational efficiency.

# Set a seed for reproducibility
set.seed(123)

# Start timer
tic("Training the model")

# Define the parameter list for XGBoost
param <- list(
  objective = "reg:squarederror",     # Objective function for regression
  eval_metric = "rmse",               # Evaluation metric: Root Mean Squared Error
  eta = 0.05,                         # Lower learning rate to prevent overfitting
  max_depth = 6,                      # Shallower trees to reduce complexity
  min_child_weight = 3,               # Conservative splits to control overfitting
  subsample = 0.9,                    # Randomly sample 90% of data
  colsample_bytree = 1,               # Use all features for training
  lambda = 1                          # L2 regularisation to prevent overfitting
)

# Perform cross-validation to tune hyperparameters
cv_results <- xgb.cv(
  params = param,
  data = train_dmatrix,       
  nrounds = 200,                      # Number of boosting iterations
  nfold = 10,                         # 10-fold cross-validation
  early_stopping_rounds = 10,         # Stop if no improvement for 10 rounds
  verbose = FALSE                     # Do not print results for each fold
)

# Train the model with the best parameters found during cross-validation
xgb_model <- xgb.train(
  params = param,                      # Parameters for XGBoost
  data = train_dmatrix,       
  nrounds = cv_results$best_iteration, # Number of boosting iterations based on cross-validation results
)

# Stop timer and display elapsed time
toc()
Training the model: 16.333 sec elapsed
# Print the trained model output
print(xgb_model)
##### xgb.Booster
raw: 465.1 Kb 
call:
  xgb.train(params = param, data = train_dmatrix, nrounds = cv_results$best_iteration)
params (as set within xgb.train):
  objective = "reg:squarederror", eval_metric = "rmse", eta = "0.05", max_depth = "6", min_child_weight = "3", subsample = "0.9", colsample_bytree = "1", lambda = "1", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.print.evaluation(period = print_every_n)
# of features: 59 
niter: 133
nfeatures : 59 

Evaluating the performance of this trained model on the validation set highlights the impact of tuning on predictive power.

# Predicted validation values using the trained model
validation_predictions <- predict(xgb_model, newdata = validation_dmatrix)

# Calculate evaluation metrics
evaluation_metrics(validation_labels, validation_predictions, train_features) |>     
    kable(
      col.names = c("Evaluation Metric", "Value"), 
      caption = "Final Model Performance Metrics", 
      digits = 4) |> 
      kable_styling(
      position = "center",
      full_width = FALSE) 
Table 13: Table 14: Final Model Performance Metrics
Evaluation Metric Value
RMSE 0.6589
MAE 0.4915
MAPE 14.7264
R-squared 0.2265
AIC -819.6650
BIC -523.2107

The increase in R-squared from 0.2195 to 0.2265 indicates improved model performance in explaining variance within the data. Additionally, metrics such as RMSE, MAE, and MAPE show noticeable improvements, confirming that the model’s predictions are closer to the actual values. The decrease in AIC and BIC further validates that the tuned model achieves a better balance between complexity and fit, resulting in a more accurate representation of the underlying data patterns. Importantly, tuning with xgb.train() proved more computationally efficient than using the caret package, offering a streamlined approach to optimisation.

Figure 10 comparing actual and predicted values of the target variable shows that the tuned model’s predictions exhibit a similar density distribution to the model tuned with caret, visually confirming its accuracy.

# Generate summary statistics for the actual target variable
glance(summary(validation_labels)) |> 
    kable(
      caption = "Summary Statistics of the Actual Target Variable", 
      digits = 2
    ) |> 
      kable_styling(
      position = "center",
    full_width = FALSE                            
  ) 
Table 15: Table 16: Summary Statistics of the Actual Target Variable
minimum q1 median mean q3 maximum
1 3.5 4 4.06 4.5 5
# Generate summary statistics for the predicted target variable
glance(summary(validation_predictions)) |> 
    kable(
      caption = "Summary Statistics of the Predicted Target Variable", 
      digits = 2) |> 
      kable_styling(
      position = "center",
    full_width = FALSE) 
Table 15: Table 15: Summary Statistics of the Predicted Target Variable
minimum q1 median mean q3 maximum
2.7 3.86 4.1 4.05 4.31 4.77
# Plot the density distribution for actual vs predicted values
plot_density_distribution(validation_labels, validation_predictions) + 
    ggtitle ("XGBoost `xgb.train` Model: Density Distribution of the Target Variable")
Density distributions of the actual values of the target variable from the validation set and the values predicted by the XGBoost model using `xgb.train()`.

Figure 10: Density distributions of the actual values of the target variable from the validation set and the values predicted by the XGBoost model using xgb.train().

Reviewing the importance of features in this model in Figure 11 reveals notable changes compared to the base model. User Rating Count has emerged as the most significant driver of Average User Rating, surpassing app Size in influence, while Description Sentiment no longer ranks among the top three features.

# Generate SHAP summary plot
xgb.ggplot.shap.summary(
  model = xgb_model,
  data = as.matrix(train_features)) +
  ggtitle("SHAP Feature Importance of the XGBoost Model") +
  theme_tufte(base_family = "Aeonik") +
  theme(
    plot.title = element_text(hjust = 0.5)) +
  labs(
    x = "", 
    y = "SHAP value (impact on model output)",
    color = "Feature Value"
  ) 
The SHAP summary plot of the XGBoost model using `xgb.train()`.

Figure 11: The SHAP summary plot of the XGBoost model using xgb.train().

These results underscore the benefits of tuning with xgb.train(), not only in terms of computational efficiency but also in achieving a refined understanding of feature importance and enhancing the model’s predictive performance.

Model Performance on Test Data

As the final model demonstrated the best evaluation metrics on the validation data, applying it to the test dataset is essential for assessing its performance comprehensively. This step evaluates the models’ ability to generalise to unseen data, ensuring that the improvements observed during validation translate effectively to real-world scenarios.

Predictions on the test data across all three models revealed a slight decline in performance compared to the validation set, suggesting that the models fit the test data slightly less effectively.

# Predicted test values using from the base XGBoost model
test_predictions_base <- predict(xgb_model_base, newdata = test_dmatrix)

# Predicted test values using from the caret-tuned XGBoost model
test_predictions_caret <- predict(xgb_caret_model, newdata = test_features)

# Predicted test values using from the final XGBoost model
test_predictions <- predict(xgb_model, newdata = test_dmatrix)

# Convert data frames to tables using kable for evaluation metrics
kable(evaluation_metrics(test_labels, test_predictions_base, train_features), caption = "Base Model's Evaluation on Test Data") |> 
      kable_styling(
      position = "center",
    full_width = FALSE) 
Table 17: Table 18: Base Model’s Evaluation on Test Data
Metric Value
RMSE 0.7393065
MAE 0.5400284
MAPE 16.1484475
R-squared 0.1244816
AIC -560.9920841
BIC -264.5377913
kable(evaluation_metrics(test_labels, test_predictions_caret, train_features), caption = "Caret-Tuned Model's Evaluation on Test Data") |> 
      kable_styling(
      position = "center",
    full_width = FALSE) 
Table 17: Table 17: Caret-Tuned Model’s Evaluation on Test Data
Metric Value
RMSE 0.6844488
MAE 0.5078527
MAPE 15.2016042
R-squared 0.1863303
AIC -734.3097936
BIC -437.8555008
kable(evaluation_metrics(test_labels, test_predictions, train_features), caption = "Final Model's Evaluation on Test Data") |> 
      kable_styling(
      position = "center",
    full_width = FALSE) 
Table 17: Table 17: Final Model’s Evaluation on Test Data
Metric Value
RMSE 0.6861636
MAE 0.5053039
MAPE 15.2287369
R-squared 0.1831256
AIC -728.6848398
BIC -432.2305470

However, the evaluation metrics remain within acceptable ranges, indicating that all models generalise reasonably well to new data. Among them, the caret-tuned model performed slightly better on the test data than the final XGBoost model, though the differences are minor. The trade-off lies in computational efficiency: while the caret-tuned model achieved marginally better results, it required significantly more time and resources during training and tuning. In contrast, the final model offers a practical balance, delivering robust performance with greater computational efficiency, making it an attractive choice for resource-constrained applications.

The slight decline in performance on the test data can be attributed to nuanced differences in patterns that were not fully captured during training and validation, despite stratified sampling. This is common in machine learning, particularly with complex models like XGBoost, which are sensitive to subtle variations in data. Additionally, the tuning process likely optimised performance for the validation set, introducing minor overfitting to its specific characteristics.

Nevertheless, the model’s consistent performance across datasets demonstrates its ability to capture overall trends effectively. These results emphasise the importance of stratified sampling in maintaining dataset consistency and achieving a balance between model complexity and predictive performance. To further enhance robustness and generalisability, future efforts should incorporate additional cross-validation techniques to capture variability across datasets better and minimise overfitting. Regularisation adjustments and exploring advanced tuning strategies, such as automated hyperparameter optimisation or ensembling with complementary algorithms, could provide further improvements and make the model even more resilient to variations in unseen data.

Conclusions

The data processing, exploratory data analysis (EDA), and model development phases provided valuable insights into the factors influencing Average User Rating and revealed key challenges in modelling this variable.

References

Gupta, A., Stead, T.S. and Ganti, L. (2024) ‘Determining a Meaningful R-squared Value in Clinical Medicine’, Academic Medicine & Surgery. Available at: https://doi.org/10.62186/001c.125154.

Cabello-Solorzano, K., Ortigosa de Araujo, I., Peña, M., Correia, L., J. Tallón-Ballesteros, A. (2023) ‘The Impact of Data Normalization on the Accuracy of Machine Learning Algorithms: A Comparative Analysis’. 18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023). Lecture Notes in Networks and Systems, vol 750. Springer, Cham. Available at: https://doi.org/10.1007/978-3-031-42536-3_33.

Conort, X. (2013) ‘Q&A with Xavier Conort’. Interview with Xavier Conort. Interviewed by Kaggle Team for Kaggle, 10 April. Available at: https://web.archive.org/web/20190609154949/http://blog.kaggle.com/2013/04/10/qa-with-xavier-conort/ (Accessed: 4 December 2024).

Lantz, B. (2023) Machine Learning with R. 4th edn. Packt Publishing Ltd. Available at: https://learning.oreilly.com/library/view/machine-learning-with/9781801071321/ (Accessed: 4 December 2024).

Appendix 1: Session Info

# This command provides detailed information about the current R session, including the version of R, the operating system, and the list of loaded packages along with their versions. It is useful for checking compatibility of packages and diagnosing issues with installed software.
sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS 26.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] caret_7.0-1        lattice_0.22-6     Metrics_0.1.4      tictoc_1.2.1      
 [5] xgboost_1.7.11.1   reshape2_1.4.4     GGally_2.2.1       ggridges_0.5.6    
 [9] syuzhet_1.0.7      textrecipes_1.1.0  recipes_1.3.1      rsample_1.3.0     
[13] scales_1.4.0       broom_1.0.8        knitr_1.50         kableExtra_1.4.0  
[17] ggthemes_5.1.0     RColorBrewer_1.1-3 plotly_4.10.4      lubridate_1.9.4   
[21] forcats_1.0.0      stringr_1.5.1      purrr_1.0.4        readr_2.1.5       
[25] tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.2      tidyverse_2.0.0   
[29] dplyr_1.1.4       

loaded via a namespace (and not attached):
 [1] pROC_1.18.5          rlang_1.1.6          magrittr_2.0.3      
 [4] furrr_0.3.1          compiler_4.5.0       systemfonts_1.2.3   
 [7] vctrs_0.6.5          xray_0.2             crayon_1.5.3        
[10] pkgconfig_2.0.3      fastmap_1.2.0        backports_1.5.0     
[13] labeling_0.4.3       rmarkdown_2.29       prodlim_2025.04.28  
[16] tzdb_0.5.0           bit_4.6.0            xfun_0.52           
[19] cachem_1.1.0         jsonlite_2.0.0       SnowballC_0.7.1     
[22] parallel_4.5.0       R6_2.6.1             bslib_0.9.0         
[25] stringi_1.8.7        parallelly_1.44.0    rpart_4.1.24        
[28] jquerylib_0.1.4      Rcpp_1.0.14          bookdown_0.43       
[31] iterators_1.0.14     future.apply_1.11.3  Matrix_1.7-3        
[34] splines_4.5.0        nnet_7.3-20          timechange_0.3.0    
[37] tidyselect_1.2.1     rstudioapi_0.17.1    yaml_2.3.10         
[40] timeDate_4041.110    codetools_0.2-20     listenv_0.9.1       
[43] plyr_1.8.9           withr_3.0.2          evaluate_1.0.3      
[46] future_1.49.0        survival_3.8-3       ggstats_0.9.0       
[49] xml2_1.3.8           pillar_1.10.2        foreach_1.5.2       
[52] stats4_4.5.0         generics_0.1.4       vroom_1.6.5         
[55] hms_1.1.3            globals_0.18.0       class_7.3-23        
[58] glue_1.8.0           lazyeval_0.2.2       tools_4.5.0         
[61] tokenizers_0.3.0     data.table_1.17.4    ModelMetrics_1.2.2.2
[64] gower_1.0.2          grid_4.5.0           crosstalk_1.2.1     
[67] ipred_0.9-15         nlme_3.1-168         cli_3.6.5           
[70] textshaping_1.0.1    viridisLite_0.4.2    svglite_2.2.1       
[73] lava_1.8.1           gtable_0.3.6         sass_0.4.10         
[76] digest_0.6.37        htmlwidgets_1.6.4    farver_2.1.2        
[79] htmltools_0.5.8.1    lifecycle_1.0.4      hardhat_1.4.1       
[82] httr_1.4.7           bit64_4.6.0-1        MASS_7.3-65